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We analyze the parallel performance of randomized interpolative decomposition by de- 
composing low rank complex- valued Gaussian random matrices larger than 100 GB. We 
chose a Cray XMT supercomputer as it provides an almost ideal PRAM model permit- 
ting quick investigation of parallel algorithms without obfuscation from hardware idiosyn- 
crasies. We obtain that on non-square matrices performance becomes very good, with 
overall runtime over 70 times faster on 128 processors. We also verify that numerically 
discovered error bounds still hold on matrices two orders of magnitude larger than those 
previously tested. 
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1 Introduction 

Computational mathematics and science increasingly involves calculations using very large matrices which 
may easily be hundreds of gigabytes in size. Nonetheless, many fundamental matrix algorithms are not 
only very slow, with time complexities of 0(n 3 ) common, but they can also scale poorly on parallel 
machines. 

A few years ago, it was proposed in a series of papers [11, 7, 5] that a procedure, coined interpolative 
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decomposition (ID), could decompose an m x n matrix, with approximate rank k, into two much smaller 
matrices. Furthermore, when using a probabilistic algorithm to perform the ID, one had quadratic scaling 
behavior and much of the algorithm could be naturally parallelized. Performing an ID on a large low-rank 
matrix not only allows for it to be stored in a much smaller amount of memory, but it allows for many 
simple operations (such as matrix multiplication) to run significantly faster. 

Although there is increasing interest in probabilistic algorithms (see, e.g., [9, 10, 3]), we are not aware of 
these algorithms having been tested on extremely large matrices where massive parallelization is essential. 
In this paper, we present results obtained on a Cray XMT supercomputer by performing a randomized ID 
on matrices over 100 GB in size, which is two orders of magnitude larger than previous matrices studied 
(without making calls out of RAM). In doing so, we confirm the conjecture that such algorithms can 
run efficiently in parallel, demonstrating that for many matrices, the algorithm's parallelism scales by 
more than two orders of magnitude. We also confirm numerical results which suggested that the error 
bounds on the procedure are largely insensitive to matrix size, implying that these procedures indeed are 
appropriate for high-precision computation on extremely large matrices. 



2 The Randomized Interpolative Decomposition Algorithm 

Let A be a m x n complex matrix with approximate rank k, with k <C m,n. By approximate rank, we 
mean that the (k + l) st largest singular value, &k+i, is small. The goal of the ID algorithm [5] is to find 
a m x k matrix B and a k x n matrix P such that 

A ps BP. (1) 

In fact, we can be more precise [11, 7]: with probability usually no smaller than 1 — 10~ 17 , 

\\A - BP\\ 2 < Vklmncr k+ x. (2) 

However, numerical testing has found that this bound is actually extremely conservative and the algorithm 
often performs substantially better [7]. This is further confirmed as shown in Section 3. 

2.1 Algorithmic Details 

Let us now briefly describe the mathematics and structure behind the algorithm. The intuitive way 
to think of the algorithm is that it "compresses" the matrix A into something with very few rows by 
attempting to extract all "information" about the linear independence of the various rows while removing 
most of the rest. FYom there, it performs highly accurate QR factorizations on a subset of columns: given 
an orthonormal basis for these reduced columns, the remainder of the matrix can be rapidly factored. The 
key to the algorithm is that the only parts of the algorithm which are slow and not efficiently parallelized 
are only run on a very tiny matrix, compared to the size of A. 

The first step of the algorithm corresponds to the "randomization" of the matrix A, which is achieved 
by compressing it into a I x n matrix Y. The parameter I can be chosen by the user; it must satisfy I > k, 
and we chose I = 2k to be safe. The compression is done by writing 



where 



Y = SFDA (3) 

S jk = $js h , (4) 

F jk = e -2^i(i-l)(fc-l)/m ( 5 ) 

D jk = e 2 ^S jk , (6) 
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and 5ij is the Kronecker 5 (1 when its indices are equal, otherwise 0), {si,...,sj} are i.i.d. 1 random 
variables uniformly distributed on {1, . . . , m}, and {4>i, ■ ■ ■ , 4>m} are i.i.d. uniform random variables on 
[0, 1]. In practice, this has a very simple interpretation: the matrix D multiplies each row by a random 
complex phase; the matrix F performs a FFT on each column, and the matrix S simply sets Y equal to 
a matrix consisting of / randomly chosen rows from the Fourier transformed matrix. 

The next step consists of an approximate QR factorization on the matrix Y: i.e., expressing 

Y « QRU T (7) 

where II is a permutation matrix, Q is a I x k matrix with orthonormal columns, R is a k x n matrix of 
the form 

R={ Ri R 2 ) (8) 

where R± is a k x k upper triangular matrix, and R 2 is a k x (n — k) matrix [1]. As with typical QR 
factorizations, the aim here is to find an orthonormal set of vectors to serve as an approximate basis for 
the columns of Y. 

The final step consists of forming B and P. B is found by taking the left-most m x k matrix of AH. 
The matrix P is constructed as follows: first, find the matrix T such that 

R 2 = RiT. (9) 

This problem can be solved exactly because Ri is upper triangular, and it reduces to the problem of 
solving Lv = w for triangular L and unknown v, given w [2]. From here, one sets 

p= ( I t ) n T . (io) 

Overall, the complexity of the algorithm is given by 0(mnlogm + Ik 2 + k 2 n), with the terms corre- 
sponding to the FFT, QR factorization and factorization of R respectively. 

It is worth noting that there are alternative algorithms, but that the algorithm proposed here is likely 
the most generally effective as it works well independent of the structure of the matrix A. However, if a 
faster method of computing the randomization step is available, that should be used instead [11]. 

2.2 Strategies for Implementation 

We now briefly describe the algorithm's intrinsic parallelism. Because the FFT can be performed on each 
column separately, and the factorization of R can also be done individually for each column, these parts 
of the algorithm exhibit both coarse and fine-grain parallelism and thus we found them to scale decently. 

Ultimately, the clear bottleneck in this procedure is the QR factorization, which must be done ex- 
tremely accurately. The typical algorithm to use here is a Gram-Schmidt algorithm. We chose to use a 
classical GS algorithm with iteration - this is the most numerically stable variant of GS [8], and it also 
works well in highly parallel contexts [6], beating out an iterated modified GS [4]. 

3 Numerical Results 

The algorithm was implemented in C with 64-bit precision floating point arithmetic. The matrix A 
was formed by constructing B and P to be Gaussian random matrices with complex entries and the 
appropriate dimensions. 

i.i.d. = independent and identically distributed 
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Figure 1: Parallel efficiency of various processes on the XMT. For all runs, take I = 2k. 



The XMT is a shared-memory, multithreaded machine. There is no cache and no local memory; all 
processors can access all memory locations in the same time. Each processor has 128 registers sets, 128 
program counters (one per register set), and a single, three instruction wide execution pipeline. The 
pipeline can execute one memory and three floating point operations per cycle. Up to 128 different 
software threads can be co-scheduled per processor. On each cycle, a processor chooses a software thread 
with a ready instruction and executes it. As long as one of the 128 threads has a ready instruction 
the processor remains busy. Thus, long latency operations such as memory accesses, synchronization 
operations, or runtime system calls are tolerated via parallelism. The parallel performance and scalability 
of an algorithm is a function of only its parallelism. The Cray XMT is almost an ideal PRAM system 
supporting equally a wide variety of parallel techniques including data and task parallelism, dataflow, and 
recurrsion. In our study, we used data parallelism, collapsing loop nests to generate a sufficient number of 
independent threads to saturate a 128 processor Cray XMT (about 100 threads per processor). In many 
instances, the compiler collapsed the nests for us; but, where it did not we used pragmas to force the 
issue. 

As expected theoretically, the FFT runtime was dominated by m, the GS runtime was dominated by 
k, and the R factorization runtime was dominated by n. Figure 1 shows the parallel efficiency of the 
factorization of R on an increasing number of processors. Note that we are defining parallel efficiency in 
the standard way. We saw parallel efficiencies of over 100 times faster on 128 processors for n = 2 18 . In 
contrast, we saw gains of only about 70 times faster on 128 processes for the FFT. As this component of 
our algorithm ran particularly fast, we can conclude that at least using our implementation, this algorithm 
runs most efficiently on fairly skinny matrices. Note that one can always arrange things easily so that 
n > m by simply taking a transpose, which will not affect the rank. 

The overall efficiency of the algorithm is shown in Figure 2. We saw excellent scaling by the time we 
reached n = 2 18 , except for on 128 processors where for some reason the FFT began performing much 
worse. This did not happen for m > 2 and thus we expect that a more optimized FFT implementation 
would avoid this issue entirely. Table 1 shows the overall runtime in seconds for the various processes on 
each of the runs shown in Figure 2; similarly, Table 2 gives the results for the Gram-Schmidt process, 
Table 3 for the FFT, and Table 4 for the factorization of R. 

As a final note, it is worthwhile to emphasize that as our benchmarking spanned 2 orders of magnitude 
in matrix size, the increase in error on average was only a single order of magnitude: from about ~ 4x 10~ n 
to 10" 9 . Thus, we have found that the numerically determined low bounds on the error in this algorithm 
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Figure 2: Parallel efficiency of the overall algorithm on the XMT. For all runs, take I = 2k. 



matrix parameters 


4 


8 


16 


32 


64 


128 


k 


= 100, to = 2 i4 , n = 


2 14 


56.0 


29.56 


16.61 


13.37 


7.16 


14.04 


k 


= 100, to = 2 16 , n = 


2 14 


188.6 


95.2 


53.52 


34.1 


21.88 


20.46 


k 


= 400, to = 2 16 , n = 


2 14 


381.1 


194.7 


103.3 


56.43 


44.2 


68.4 


k 


= 400, to = 2 18 , n = 


2 14 


923.9 


476.1 


244.7 


133.6 


84.12 


76.83 


k 


= 100, to = 2 16 , n = 


2 16 


732.8 


375.4 


196.5 


129.4 


77.16 


67.87 


k 


= 1000, to = 2 16 , n = 


2 16 


5657 


2873 


1492 


806.6 


455.3 


444.01 


k 


= 400, to = 2 14 , n = 


2 18 


3780 


1919 


1001 


531.4 


363.9 


310 


k 


= 1000, to = 2 14 , n = 


2 18 


20167 


10233 




2894 


1479 


1099 



Table 1: Runtime of the total algorithm, in seconds, for the given parameters and number of processors 
listed in the top row. 



continue to hold up to matrices approaching 1 TB in size. Table 5 shows the errors bounds for some 
sample runs. 



4 Conclusion 

This letter demonstrates that the parallel implementations of randomized algorithms are efficient and scale 
well for very large matrices. These algorithms do tend to become less ideal on 128 processors because 
the FFT is starved, but we emphasize that as our results can be obtained for arbitrary matrices with low 
rank, even the scaling we found is very impressive. 

Furthermore, concerns about the breakdown of the randomized algorithm on matrices approaching 1 
TB in size are unfounded - we have demonstrated that the error in the algorithms are largely insensitive 
to matrix size, assuming that the choice of k is large enough. Finally, we stress that our benchmarking 
was done on dense random matrices - it is quite likely that the scaling would be far better were the 
low rank matrices in question of a more specialized form where certain steps in the algorithm could be 
optimized further. 

We stress that optimizing the Gram-Schmidt and FFT algorithms on highly parallel architectures is 
an active area of research and it is almost assured that specialized code would be faster than ours on these 
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Table 2: Runtime of the FFT, in seconds, for the given parameters and number of processors listed in 
the top row. Note that this is effectively independent of k. 
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Table 3: Runtime of the Gram-Schmidt process, in seconds, for the given parameters and number of 
processors listed in the top row. Note that it is effectively independent of m and n. 

components. However we did not feel it appropriate to emphasize the optimization of these algorithms in 
the preparation of this letter, as our focus was instead on the parallelization of randomized algorithms. 
Work on improving these components of the algorithm, as well as finding specialized random algorithms 
for certain classes of large (low-rank) matrices is a worthwhile direction for future research. 
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